Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto-techsupport] support techsupport generation on potential memory… #939

Merged
merged 9 commits into from
Mar 20, 2022

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Feb 4, 2022

… leaks

PR title state context
[auto-ts] add memory check GitHub issue/pull request detail GitHub pull request check context
[auto-ts] add memory check GitHub issue/pull request detail GitHub pull request check context

@qiluo-msft qiluo-msft self-requested a review March 7, 2022 17:28

the SONiC techsupport is automatically generated.

```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 Mb by default, as at least 80-90 Mb takes "show techsupport" execution.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mem_free_threshold

If suddenly mem_free_threshold < 80 MB, show techsupport will always fail. This invocation will make situation worse. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft If user configures too low mem_free_thrshold the "show techsupport" will fail. Currently there is no upper limit of "show techsupport" peak memory consumption, which will make it hard to verify the user does not configure this threshold too low.

This invocation will make situation worse.

It is assumed that the situation is already critical and we try to take the dump as best effort.

BTW, the swap partition would make it at least possible to execute some commands, however, it is not configured in SONiC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for misleading. I am asking if the free memory suddenly drop to below 80 MB, show techsupport will always fail. This invocation will make situation worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft Do you mean that if suddenly an application perfroms a very big allocation so that we won't be able to perform "show techsupport". A similar situation might happen when the process leaks memory very quickly so that we don't have enough memory to inoke "show techsupport". In this design the dump collection is best effort, so we don't guaranty that the dump will be actually generated because of that. It is also assumed that if the system is running with that low free memory the situation is already critical (very likelly that reboot will happen) so that additional "show tech" does not help or make it worth, it is making the reboot happen earlier. If the user does want the system to stay with a very low free memory than the user has to disable this feature. Otherwise I don't see a way to handle such a situation.

if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler"
```

The action is going to be ran only once the mem_check script detects memory usage above threshold.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once

Do you mean once after each bootup? Considering some fluctuations and finally a memory exhausting, you will keep only the first fluctuation "show techsupport" file? #Closed

Copy link
Contributor Author

@stepanblyschak stepanblyschak Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft I mean that the techsupport will run once if the script returns non zero status code for 10 times within 20 cycles. If memory usage is constantly too high for e.g 100 cycles, only 1 dump will be taken. If lets say the memory usage fluctuates, in example, 10 cycles too high in a row and then goes down for 10 cycles and then again up - there is a protection against too frequent auto tech dumps already present in the auto tech infrastructure.


the SONiC techsupport is automatically generated.

```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 Mb by default, as at least 80-90 Mb takes "show techsupport" execution.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mb

Mb -> MB #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft updated in next patch


### Config DB

#### AUTO_TECHSUPPORT Table
```
key = "AUTO_TECHSUPPORT|global"
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak.
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kb

Kb -> KB #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft updated in next patch


### Config DB

#### AUTO_TECHSUPPORT Table
```
key = "AUTO_TECHSUPPORT|global"
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak.
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscore instead of dash? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft Right, updated in next patch


### Config DB

#### AUTO_TECHSUPPORT Table
```
key = "AUTO_TECHSUPPORT|global"
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak.
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kb

The same #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If always > 1MB, use MB as unit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft Agree, updated in next patch

@@ -226,27 +298,43 @@ module sonic-auto_techsupport {
#### AUTO_TECHSUPPORT_DUMP_INFO Table
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table

Is there an entry number limit in this table? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft There is a rate limiting mechanism already available in auto techsupport infrastructure that limits the number of dumps that can be generated.

@@ -36,13 +37,20 @@ Currently, techsupport is run by invoking `show techsupport` either by orchestra

However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section

Another use case is to gather more information about the system in case there is a memory usage threshold crossed.
SONiC dump generated after system reboots due to out of memory is not enough for debugging the issue
as all the information about processes and their mem usage, smaps (/proc/PID/smaps) is lost.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lost

Instead of invoking a full show techsupport, can we just save volatile information before reboot? For example, it makes no sense to save syslog during the memory pressure time. We could collect them after reboot. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft The user can configure the "since" parameter in CFG_DB and achive that result. If "since" is set to "0 sec" that would mean that no logs, sai, swss recs are recorded.

if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler"
```

The action is going to be ran only once the mem_check script detects memory usage above threshold.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threshold

If auto invoking too many times, do we have file rotation? Do we only rotate the auto dump files and keep user manual dump? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft There are "rate-limit-interval", "max-techsupport-limit" parameters in AUTO_TECHSUPPORT table which are already part of auto tech infrastructure that limit the number of dumps. This is only applicable for automatically generated dumps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is auto rotation covered by design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@liat-grozovik
Copy link
Collaborator

@stepanblyschak please update PR description with the PRs list covering this feature.

@liat-grozovik liat-grozovik merged commit 61e6c87 into sonic-net:master Mar 20, 2022
liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request May 14, 2022
- What I did
Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939.

- How I did it
Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added.

- How to verify it
Run the action script and the handler script on the switch. Run UT.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 20, 2022
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to sonic-net/SONiC#939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on sonic-net/sonic-utilities#2116
liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request Oct 2, 2022
Backport of #2116

- What I did
Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939.

- How I did it
Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added.

- How to verify it
Run the action script and the handler script on the switch. Run UT.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Oct 3, 2022
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to sonic-net/SONiC#939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on sonic-net/sonic-utilities#2116
stepanblyschak added a commit to stepanblyschak/sonic-buildimage that referenced this pull request Oct 6, 2022
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to sonic-net/SONiC#939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on sonic-net/sonic-utilities#2116
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Oct 6, 2022
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to sonic-net/SONiC#939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on sonic-net/sonic-utilities#2116
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this pull request Aug 3, 2023
- What I did
Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939.

- How I did it
Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added.

- How to verify it
Run the action script and the handler script on the switch. Run UT.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants